;;; -*- Mode: TEXT -*- ;;; File: AutoClass:doc;preparation.text ;;;————————————————————————–;;; ;;; AUTOCLASS 3.0 Released 5/11/90 contact: Taylor@pluto.arc.nasa.gov ;;; ;;; by P. Cheeseman, J. Stutz, R. Hanson, W. Taylor ;;; ;;; NASA Ames Research Center, MS 244-17, Moffett Field, CA 94035 ;;; ;;; ;;; ;;; Copyright (C) 1990 Research Institute for Advanced Computer Science. ;;; ;;; All rights reserved. The RIACS Software Policy contains specific ;;; ;;; terms and conditions on the use of this software, and must be ;;; ;;; distributed with any copies. THIS FILE MAY BE REDISTRIBUTED. This ;;; ;;; copyright and notice must be preserved in all copies made of this file. ;;; ;;;————————————————————————–;;; PREPARING DATA FOR AUTOCLASS 1.0 Introduction 1.1 Applicable Types of Data 1.2 Probability Models 1.3 Input Files 1.3.1 Data File 1.3.2 Header File 1.3.2.1 Header File Example 1.3.3 Model File 1.3.3.1 Model File Example 1.4 Checking Input Files 1.0 Introduction This documentation file is directed at anyone who will be preparing data bases for AutoClass 3.0. It requires no statistics or Artificial Intelligence background, just basic knowledge of the Lisp language. 1.1 Applicable Types of Data AutoClass is applicable to observations of things that can be described by a set of features or properties, without referring to other things. This allows us to represent the observations by a data vector corresponding to a fixed attribute set. Attributes are names of measurable or distinguishable properties of the things observed. The data values corresponding to each attribute are thus limited to be either numbers or the elements of a fixed set of attribute specific symbols. With numeric data, a fixed measurement error is assumed and must be provided with the attribute description. AutoClass cannot express relationships between things because such relationships are not a property of the thing itself. Nor can AutoClass deal with properties expressed as sets of values. However the current models do allow for missing or unknown values. The program itself imposes no specific limit on the number of data, but databases having more than 10^4 attribute values may require excessive search time. Note that there are techniques for re-expressing some data types in forms acceptable to AutoClass. If a set valued property is limited to subsets of a set of symbols, one can re-express the property as a set of binary attributes, one for each of the possible symbols. Temporal ordering data can be expressed as "time of (year, week, day)" or "time elapsed since ...". And one can always indicate that a relation has been observed, even if the related thing cannot be named. A simple example of the later is the transformation of `married-to' to `married?'. 1.2 Probability Models The current models assume that attributes are conditionally independent given the class. Thus within each class the probability that an instance of the class will have a particular value of any attribute depends only on the class and is independent of all other attribute values. The probability that the class would produce any particular instance is then the product of the individual attribute probability terms. At present it is not possible to model relations between instances that are not conditional on the class alone. This is a limit of the current likelihood model set and will be corrected in a future release. We use a multinomial model term for discrete attributes of nominal, ordered, and circular subtypes (all are currently handled identically). This model term allows any number of values including missing. We use a Gaussian normal model term for real numerical attributes, or any representing measurements. There are actually two versions, one of which allows for the possibility of missing values. There is also an `ignore' model term for attributes which are not to be considered in generating the classification. The set of currently available model terms is the value of *model-term-types*, generated as the model files are loaded. 1.3 Input Files An AutoClass data base resides in two files. There is a a header file (default type "hd2" from *header-file-type*) that describes the specific data format and attribute definitions. The actual data values are in a data file (default type "db2" from *data-file-type*). We use two files to allow editing of data descriptions without having to deal with the entire data set. This makes it easy to experiment with different descriptions of the database without having to reproduce the data set. Internally, an AutoClass database structure is identified by it's header and data files, and the number of data loaded. The set of currently loaded data bases may be found at *db-list*. A classification of a data base is made with respect to a model which specifies the form of the probability distribution function for classes in that data base. Normally the model structure is defined in a model file (default type "model" from *model-file-type*), containing one or more models. Internally, a model is always defined relative to a particular database. Thus it is identified by the corresponding database, the model's model file and it's sequential position in the file. A specific model may be used by any number of simultaneous classifications of the data base. A model file may be used with any number of data bases to produce specific models for those databases. See *model-list* for the currently loaded models. 1.3.1 Data File The format of the data file is that of data objects (datum) terminated by the end of the file. The number of values for each data object must be equal to the number of attributes defined in the header file. There is an implied #after each data object. Note that data objects may be either vectors, lists, or groups of tokens delimited by #or #. Missing attribute values in the data file may be represented by either 'nil, #, or other symbols specified in the header file. The internal representation of a missing value is 'nil for all data types. Individual data values may be numbers (both integer and floating point), strings, or symbols. Any lisp readable object may be used as the value of an attribute which is typed as 'dummy in the header and is ignored by the models. Example: (data-syntax :vector) #(7.8674307 33.311752 0.6008e03 10 1 1) #(5.3936334 30.08755 0.6634e03 4 2 1) #(6.838643 39.646942 0.6115e03 2 1 1) #(5.4278746 26.337687 nil 0 ? 1) Example: (data-syntax :list) ("Dry Rot" 35.18797 0.5388601 3 1 1) ("Wet Rot" 26.803675 0.53074133 5 1 1) (nil 27.456902 0.5660058 ? 2 1) ("All Rot" 38.981537 0.62709737 7 1 1) Example: (data-syntax :line) white 38.991306 0.54248405 2 2 1 red 25.254923 0.5010235 9 2 1 yellow 32.407973 nil 8 2 1 all-white 28.953982 0.5267696 0 1 1 1.3.2 Header File The header file specifies the data file format, the definitions of the data attributes, and optional discrete attribute value translations. The value translations for discrete type attributes provide a level of data abstraction. For example, if the data values for an attribute are 1, 2, 3, .. 9; and their meanings are "New York", "Chicago", "Los Angeles", ....; then a translator can be defined such that the influence values report or the cross-reference by class report (discussed in file reports.text) will use the string names, instead of the integers. The header file contains function calls to DEFINE-DATA-FILE-FORMAT and DEFINE-ATTRIBUTE-DEFINITIONS, and optionally to DEFINE-DISCRETE-TRANSLATIONS. Note that if you are working in Symbolics Lisp or TI Explorer Lisp, then the file's "mode line" package argument should be AUTOCLASS (*ac-pkg*) to get function argument definitions, and to allow the file to be loaded rather than read. The header file functional specification follows: (DEFINE-DATA-FILE-FORMAT *** REQUIRED *** (&key number-of-attributes (separator-chars '(# )) ;; add other characters, as needed (comment-chars '(# )) ;; add other characters, as needed (unknown-tokens '(? nil)) ;; add other symbols, as needed (data-syntax :line) ;; one of :vector, :list, or :line (data-base *input-data-base*)));; used only when called directly Note: :separator-chars, :comment-chars, & :unknown-tokens, when specified, do not need to include the default characters. (DEFINE-ATTRIBUTE-DEFINITIONS *** REQUIRED *** '<Attribute Descriptors List>) Attribute Descriptors declare how to interpret attribute values. A descriptor applies to an attribute (index), or a list of attribute indices or to the symbol 'default. Duplication of an attribute index will cause a break. Omitted attributes will either receive the specified default or be declared to 'dummy. A warning message will be generated by the AutoClass file reading functions for any unspecified attributes which are set to 'dummy. Each descriptor is a list of: Attribute index (zero based), or list of indices, or 'default. Attribute type. Must be a property indicator in the list *att-type-data* Attribute sub-type. Must be an indicator in the property value of type. Attribute description string. List of type and subtype specific property type and value pairs. See *att-type-data* for the available properties. Others will be added. Currently available combinations: type sub-type property type(s) —- ——– ————— dummy none – real location error real scalar zero-point rel-error real scalar zero-point error discrete nominal range discrete ordered range ordering discrete circular range ordering An example is given in 1.3.2.1. Note that the last three combinations will be handled identically, until appropriate specializations of the multinomial model have been developed. The value of *Att-type-data* gives the relations that are currently in effect. The commented out portions of the definition indicate possible future directions in attribute type representations. (DEFINE-DISCRETE-TRANSLATIONS *** OPTIONAL *** '<Discrete Attribute Translations List>) This only applies for 'discrete type attributes, and will optionally be constructed for you from the data, if not supplied. However, the data abstraction feature will be then not be available. <Discrete Attribute Translations List>: Each translation is a list of: Discrete Attribute index (zero based), or list of indices, or 'default. Zero or more pairs of input-form output-form translators. Alternately the translations pair list may be given as the 'translations property of the attribute definition descriptor. Such translations will take precedence. 1.3.2.1 Header File Example (define-data-file-format :number-of-attributes 25 :separator-chars '(# ) :comment-chars '(#||) :unknown-tokens '(unk) :data-syntax :vector) (define-attribute-definitions '((0 dummy nil "True class, range = 1 - 3" (range 3)) (1 real location "X location, m. in range of 25.0 - 40.0" (error .25)) (2 real location "Y location, m. in range of 0.5 - 0.7" (error .05)) (3 real scalar "Weight, kg. in range of 5.0 - 10.0" (zero-point 0.0 rel-error .001)) (4 discrete nominal "Truth value, range = 1 - 2" (range 2)) (5 discrete nominal "Color of foobar, 10 values" (range 10)) (6 discrete ordered "Spectral color group" (range 6 ordering (r o y g b v))) (7 discrete circular "Points of Compass" (range 8 ordering (N NE E SE S SW W NW))) ((8 9 10 11 12 13 14 15 16 17 18 19 20) real scalar "spectral intensity" (zero-point 0.0 rel-error .001)) (default discrete nominal "logical noise" (range 2)))) (define-discrete-translations '((5 (n brown) (b buff) (c cinnamon) (g gray) (r green) (p pink) (u purple) (e red) (w white) (y yellow)) (6 (r red) (o orange) (y yellow) (g green) (b blue) (v violet)) (4 (1 false) (2 true)) (7 (N North) (NE Northeast) (E East) (SE Southeast) (S South) (SW Southwest) (W West) (NW Northwest)) (default (0 false) (1 true)))) 1.3.3 Model File The model file contains data describing the model(s) that will be used for the classification. This file is read, not loaded. Each model is specified by a list of model group lists. Each model group list associates some attributes with a model term type. Each model group list consists of: An interaction term type (one of *model-term-types*). Zero or more attribute set lists of attribute indices, or the symbol 'default. Notes: At least one model description list is required. There may be multiple entries in a model for any model term type. An attribute index alone is equivalent to a singleton attribute set. An attribute index must not appear more than once in a model list. Ignore is not a valid 'default model term type. *Model-Term-Types* currently looks like this: (single-multinomial single-normal-cn single-normal-cm ignore) See the corresponding "model-<model-term-type>.lisp file for detailed model descriptions. Single-Multinomial models discrete attributes as multinomials. Single-Normal-cn models real valued attributes as normals. Single-Normal-cm models real valued attributes with missing values. Ignore allows the model to ignore an attribute. 1.3.3.1 Model File Example A model list suitable for the above header file follows. Note that since all of the current model terms take single attributes, single indices have been substituted for the attribute set lists needed for multiple terms: ((ignore 0) (single-normal-cn 1 2 3) (single-multinomial 4 5 6 7) (single-normal-cm 'default) (single-multinomial 21 22 23 24)) The following illustrates how multiple attribute terms will be handled: ((ignore 0) (single-normal-cn 1) (multi-normal-cn (2 3)) (single-multinomial 4 7) (joint-multinomial (5 6) (21 22 23 24)) (sparse-multi-normal-cm (8 9 10 11 12 13 14 15 16 17 18 19 20))) 1.4 Checking Input Files A function named AUTOCLASS-INPUT-CHECK is provided to check the validity of a set of data, header, and model files without initiating a classification search. Thus errors and warnings can be dealt with prior to beginning the search, hopefully contributing to a more useful classification search. A history of the error and warning messages is saved, by default, in a log file. The input argument key-word list of this function is: data-file (header-file "") (model-file "") (log-file-p t) output-files-default (reread t) (regenerate t) n-data It reads and returns the data base and model(s) defined by :data-file, :header-file, and :model-file. The :data-file value must be a fully qualified pathname. The :header-file value can be a fully qualified pathname, a file name (its root will default to that of :data-file), or not provided at all (its pathname will default to that of :data-file). The :model-file behaves in the same manner as :header-file. File name extensions (file types) for :data-file, :header-file, and :model-file are forced to canonical values by the AutoClass program: :data-file "db2" (defined by *data-file-type*) :header-file "hd2" (defined by *header-file-type*) :model-file "model" (defined by *model-file-type*) If :log-file-p is t and :output-files-default is nil, the log file will be named by default "<data-file-name>&<header-file-name>&<model-file-name>", and will have the same root as :data-file. Specifying :output-files-default as a file name (e.g. "my-log-file") will override the default name. Specifying it as a path name (e.g. "my-record-dir/my-log-file") will override the pathname of :data-file, as well. If :log-file-p is nil, then no log file is generated. The log file is created with keyword options: :if-exists :append and :if-does-not-exist :create, so that multiple sessions of AUTOCLASS-INPUT-CHECK will result is only one log file, as long as only the version numbers, and not the file names, of :data-file, :header-file and :model-file, change. The file extension of the log file is forced to "log" (defined by *log-file-type*). Note that :output-files-default is also an argument to AUTOCLASS-SEARCH and GENERATE-CLSF, so it can be used to give names to all your output files ("log", "search", and "dump"/"results") consistent names for a particular classification run. The switches :reread and :regenerate are defaulted to t to force complete re-reading of the data file, and re-generation of the models. This incorporates changes and corrections which you make to the data file, the model file, and the header file. N-data, if supplied, allows the reading of less than the full data file. This is useful when the data-file is very large and you are just interested in validating the header and model file contents. However, once this is accomplished, invoke the function CLEAR-STORES to clear out the stored short data base. This assures that subsequent processing using either AUTOCLASS-INPUT-CHECK, AUTOCLASS-SEARCH, or GENERATE-CLSF will reload the complete data base and regenerate the model. All advisory, warning, and error messages are output to the screen. And to the log file, providing that the :log-file-p argument is t (the default). Advisory messages are output to provide information which is not crucial to the continuance of the run. Warning messages contain information which may affect the quality of the run. However, the default condition is to NOT stop the run when one or more warning messages are generated. The Common Lisp global variable *break-on-warning* controls this functionality: binding this variable to t will cause AUTOCLASS-SEARCH or GENERATE-CLSF to "break" on warning messages. The function AUTOCLASS-INPUT-CHECK does not generate a classification or invoke a search, hence it is finished when it outputs any warning messages. Error messages are fatal, and if generated during the invocation of GENERATE-CLSF or AUTOCLASS-SEARCH, the run state will be terminated.